Investigation of Japanese PnG BERT Language Model in Text-to-Speech Synthesis for Pitch Accent Language
نویسندگان
چکیده
End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle challenge of accent in Japanese TTS, we adopt PnG~BERT, self-supervised pretrained model character and phoneme domain We investigate effects features captured by PnG~BERT on TTS modifying fine-tuning condition to determine conditions helpful inferring accents. manipulate content being text-oriented speech-oriented changing number fine-tuned layers during In addition, teach information with tone prediction as an additional downstream task. Our experimental results show that pretraining contain accent, outperforms baseline Tacotron correctness listening test.
منابع مشابه
Automatic pitch accent prediction for text-to-speech synthesis
Determining pitch accents in a sentence is a key task for a textto-speech (TTS) system. We describe some methods for pitch accent assignment which make use of features that contain information about a complete phrase or sentence, in contrast to most previous work which has focused on using features local to a syllable or word. Pitch accent prediction is performed using three different technique...
متن کاملToward Language-independent Text-to-speech Synthesis
Text-to-speech (TTS) synthesis is becoming a fundamental part of any embedded system that has to interact with humans. Language-independence in speech synthesis is a primary requirement for systems that are not practical to update, as is the case for most embedded systems. Because current text-to-speech synthesis usually refers to a single language and to a single speaker (or at most a limited ...
متن کاملCRF-based statistical learning of Japanese accent sandhi for developing Japanese text-to-speech synthesis systems
In Japanese, every content word has its own H/L pitch pattern when it is uttered isolatedly, called accent type. In a TTS system, this lexical information is usually stored in a dictionary and it is referred to for prosody generation. When converting a written sentence to speech, however, this lexical H/L pattern is often changed according to the context, known as word accent sandhi. This accen...
متن کاملA markup language for text-to-speech synthesis richard sproat
Text-to-speech synthesizers must process text, and therefore require some knowledge of text structure. While many TTS systems allow for user control by means of ad hoc ‘escape sequences’, there remains to date no adequate and generally agreed upon system-independent standard for marking up text for the purposes of synthesis. The present paper is a collaborative effort between two speech groups ...
متن کاملA Markup Language for Text-to-speech Synthesis
Text-to-speech synthesizers must process text, and therefore require some knowledge of text structure. While many TTS systems allow for user control by means of ad hoc ‘escape sequences’, there remains to date no adequate and generally agreed upon system-independent standard for marking up text for the purposes of synthesis. The present paper is a collaborative effort between two speech groups ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Journal of Selected Topics in Signal Processing
سال: 2022
ISSN: ['1941-0484', '1932-4553']
DOI: https://doi.org/10.1109/jstsp.2022.3190672